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ABSTRACT 

Deciphering the structure of gene regulatory 
networks across the tree of life remains one of the 
major challenges in postgenomic biology. We 
present a novel ChlP-seq workflow for the archaea 
using the model organism Halobacterium salinarum 
sp. NRC-1 and demonstrate its application for 
mapping the genome-wide binding sites of natively 
expressed transcription factors. This end-to-end 
pipeline is the first protocol for ChlP-seq in 
archaea, with methods and tools for each stage 
from gene tagging to data analysis and biological 
discovery. Genome-wide binding sites for transcrip- 
tion factors with many binding sites (TfbD) 
are identified with sensitivity, while retaining speci- 
ficity in the identification the smaller regulons 
(bacteriorhodopsin-activator protein). Chromosomal 
tagging of target proteins with a compact epitope 
facilitates a standardized and cost-effective 
workflow that is compatible with high-throughput 
immunoprecipitation of natively expressed tran- 
scription factors. The Pique package, an open- 
source bioinformatics method, is presented for 
identification of binding events. Relative to ChlP- 
Chip and qPCR, this workflow offers a robust 
catalog of protein-DNA binding events with 
improved spatial resolution and significantly 
decreased cost. While this study focuses on the 
application of ChlP-seq in H. salinarum sp. 
NRC-1, our workflow can also be adapted for use 
in other archaea and bacteria with basic genetic 
tools. 



INTRODUCTION 

The dynamic modulation of gene expression is an import- 
ant mechanism that allows organisms to sense and 
respond to changes in their environment. These changes 
in expression profiles are mediated by dynamic associ- 
ations of transcription factors and their cognate regula- 
tory regions, collectively known as gene-regulatory 
networks (GRNs) (1). Regulatory networks integrate 
complex cellular and environmental cues, orchestrating 
intricate phenotypes essential for physiology and develop- 
ment. The evolutionary rewiring of these regulatory 
circuits is thought to be an important driver of speciation 
(2). Elucidating the structure and function of GRNs is 
therefore a major research initiative in functional 
genomics and systems biology (3-8). 

The characterization of GRN architecture has been 
driven by advances in experimental and computational 
methods for identifying genome-wide protein-DNA inter- 
actions (9-13). One such approach is chromatin 
immunoprecipitation (IP) coupled with high-throughput 
sequencing (ChlP-seq), a method that provides quantita- 
tive genome-wide mapping of target protein-binding 
events. ChlP-seq identifies protein-binding sites with 
improved spatial resolution and decreased cost relative 
to previous microarray-based ChlP-chip technologies 
(10). While ChlP-seq has become a widely used tool in 
eukaryotic systems, this method has been applied only 
once in a bacterial system (14) and there exist no instances 
of such work in archaea. The small size of bacterial and 
archaeal genomes makes this high-throughput sequence 
technology particularly attractive, as sample multiplexing 
can be used to dramatically reduce costs relative to 
microarray-based platforms. 

Developing a ChlP-seq protocol for archaea would 
stimulate high-throughput characterization of GRNs, 
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which are a nascent area of study relative to work in the 
other two domains of hfe. Archaea are essential drivers of 
global biogeochemical cycling, integral players in indus- 
trial applications and biomedically important organisms. 
Furthermore, the transcriptional apparatus of archaea 
exhibits properties of both eukaryotic and bacterial 
systems, making it an intriguing target for investigating 
basic principles of regulatory mechanisms across the tree 
of life (15). Improved understanding of archaeal informa- 
tion processing and transcriptional regulation has wide- 
spread applicability. 

We present a novel ChlP-seq workflow for the archaea 
using the model organism Halobacterium salinarum sp. 
NRC-1 {Hb. NRC-1) and demonstrate its application for 
mapping the genome-wide binding sites of natively ex- 
pressed transcription factors. Previous bacterial and 
archaeal ChIP methods have taken different approaches 
involving either costly protein-specific antibodies against 
native proteins (14) or a standard antibody against 
epitope-tagged target proteins that are constitutively 
overexpressed from a heterologous plasmid (16,17). This 
protocol combines these methods by employing a single, 
commercially available antihemagglutinin (HA) antibody 
against natively expressed recombinant target proteins. 
This ChlP-seq method maintains sensitivity and specifi- 
city with as little as ~lml of the typical bacterial or 
archaeal culture, making it suitable for high-throughput 
analyses. Multiplexing of samples during sequencing sig- 
nificantly decreases experimental costs relative to previous 
ChlP-chip methods, without diminishing sensitivity and 
specificity. 

A complimentary bioinformatics method is presented 
for user-friendly identification of binding events using 
the Pique python package. Integration with the Gaggle 
toolkit streamlines the exploration and analysis of 
putative protein-binding sites (18,19). This end-to-end 
workflow for ChlP-seq of natively expressed proteins 
provides a suitable platform for large-scale studies of the 
structure and dynamic remodeling of GRNs. The first 
protocol of its kind for archaea, this method can be 
adapted for work in all bacteria and archaea with 
suitable genetic tools. 

MATERIALS AND METHODS 

Construction of the pRSKOl ligation-independent cloning 
vector for Halobacterium salinarum NRC-1 

The plasmid pNBKOT (obtained from N. Baliga, Institute 
for Systems Biology, Seattle, WA) has been previously 
used to create targeted gene knockouts (17,20-22) in the 
Hb. NRC-1 ApyrF uracil auxotroph strain (Hb. NRC-1 
ApyrF). For this study, pNBKO? was modified to facihtate 
Gateway Hgation-independent cloning of segments of 
DNA suitable for chromosomal modification by homolo- 
gous recombination. Plasmid pNBKO? (sequence and 
maps in Supplementary Information) was digested with 
StuI (New England Biolabs cat. #R0187S, Ipswitch, 
MA), a blunt-end cutting restriction endonuclease. The 
digested vector was subsequently dephosphorylated 
with calf intestinal alkahne phosphates (New England 



Biolabs cat. #M0290S) and purified via agarose gel extrac- 
tion. PGR primers ml3F and ml3R (Supplementary 
Table SI) were used to ampHfy a fragment of the 
pDONR221 vector containing an attFl recombination 
site, the ccdB gene, a chloramphenicol resistance marker 
and an attFl recombination site. This PGR product was 
ligated into the Stul-digested pNBKO? backbone to create 
the Gateway (Invitrogen, Garlsbad, GA) compatible 
pRSKOl vector. The pRSKOl vector was sequenced 
using the pNBKOTF, pNBKOTR, ccdB500F and 
ccdB500B primers. Sequences (plasmid and oligonucleo- 
tide) and plasmid maps are provided in Supplementary 
Information. 

Construction of chromosomally tagged 
transcription factors 

Ghromosomally epitope-tagged transcription factors were 
made by way of site-specific homologous recombination in 
Hb. NRC-1 ApyrF. This method is analogous to that used 
for making in-frame gene knockouts using the pNBKO? 
vector as previously described (22). Two different 
approaches were utilized to make the epitope-tagging con- 
structs for this study. In the first approach, PGR-mediated 
^pHcing by overlap extension (SOEing) (23) was used to 
join two PGR products used to add a sequence encoding 
a region ~ 500 bp upstream of the bacteriorhodopsin- 
activator protein (bat) stop codon, an HA epitope coding 
sequence, a new stop codon and ~500bp downstream of 
the bat genomic stop codon. PGR primers are Hsted in 
Supplementary Table SI. This PGR product was cloned 
into the StuI site of plasmid pNBKOT, which was subse- 
quently transformed into strain Hb. NRC-1 ApyrF. A 
two-step, double-crossover process was followed to 
chromosomally insert the HA epitope. First crossover re- 
combinants were selected by plating on 2% (w/v) complete 
media (GM) agar plates containing 20 |ig/ml mevinolin. 
Second crossover recombinants were enriched by selecting 
on 2% GM agar plates containing 300 |ig/ml 5-fluoroorotic 
acid (5-FOA). The absence of a functional pyrF gene is 
required for survival on 5-FOA, indicating loss of plasmid. 

The second method for chromosomal epitope tagging 
was used to tag the general transcription factor tfbD. This 
method takes advantage of commercial DNA synthesis 
technologies and the new Gateway cloning compatible 
pRSKOl vector. In this case, a construct consisting of an 
attBl recombination site, ~500bp upstream of the tfbD 
stop codon, the sequence encoding an HA epitope tag, a 
stop codon, ~500bp downstream of the tfbD chromo- 
somal stop codon, and an attB2 recombination site were 
directly synthesized by Geneart (Invitrogen, Garlsbad, 
GA) and delivered, cloned, in a pANY backbone vector 
also encoding an ampicillin-resistance marker. This vector 
was used directly in an in vitro recombination reaction 
(Gateway cloning, Invitrogen, Garlsbad, GA) with the 
pRSKOl vector according to manufacturer's protocols to 
move the synthetic construct into pRSKOl. Once the syn- 
thetic construct is inserted into pRSKOl, the rest of the 
tagging procedure is identical to that used for pNBKOT- 
based tagging. We have also used a combination of 
SOEing and Gateway recombination to directly clone 
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PCR products, flanked by appropriate attB recombination 
sites, directly into pRSKOl (data not shown). 

Verification of chromosomal tagging 

The insertion of HA epitopes at the C-terminal ends of the 
chromosomally encoded bat and tfbD genes was verified 
both by PCR and DNA sequencing. The following PCR 
reactions summarized in Supplementary Figure S2 were 
conducted to verify insertion of the HA epitope. 

The initial PCR screen (Reaction 1) verified the 
presence of the C-terminally tagged gene of interest in 
the cell using a forward primer {ci_gene_of_interestjdi_F) 
located in the gene and a reverse primer complimentary to 
the HA epitope tag's sequence (HA_epiope_R). PCR 
products of the expected size indicate either successful 
chromosomal tagging or the presence of residual tagging 
vector in the cell. 

Recombinant strains were also screened for the presence 
of chromosomally encoded pyrF using primers flanking 
the chromosomally encoded gene (k_vngl67g3_Q_F and 
\L_vngl673g_d_K, Reaction 2). The presence of 
chromosomally encoded pyrF yields a 2050 bp PCR 
product, while the disrupted pyrF in the Hb. NRC-1 
ApyrF strain yields a PCR product of 712 bp. Reaction 
2 was performed to verify that the pyrF gene from the 
plasmid has not reintegrated into the chromosome of the 
Hb. NRC-1 ApyrF strain. 

A second pyrF PCR screening (Reaction 3) was carried 
out to confirm that the plasmid carrying the pyrF hsid been 
cured and that pyrF had not recombined into the chromo- 
some of the Hb. NRC-1 A/^jrF strain. Reaction 3 amplifies 
a region from 465 bp upstream of the pyrF stop codon to 
70 bp downstream of the pyrF stop codon (primers 
yi_vngl673g_g_F and k_vngl673g_h_R). This final 
reaction yields no product in the Hb. NRC-1 ApyrF 
strain and its derivatives. In strains that do carry the 
pyrF gQUQ, such as wild type Hb. NRC-1, or strains trans- 
formed with either the pNBK07 or pRSKOl plasmids, a 
535 bp product is formed. 

Finally, we screen specifically for plasmid-encoded 
copies of the pyrF gene using primers that amplify a 
segment of the pyrF encoded by the pNBK07 or 
pRSKOl vectors (k_vngl673g_g_F and o_pNBK07_a_R, 
Reaction 4). The absence of product confirms that the 
plasmid has been cured, when the reaction is run in con- 
junction with plasmid-containing positive control. 

PCR products derived from PCR reaction using primers 
ci_gene_of_interest_di_F and ci_gene_of_interest_d_R on 
strains meeting all the criteria established by the verifica- 
tion Reactions 1-4 above were sequenced via standard 
Sanger sequencing to verify the integration of the HA 
epitope. Sequences are provided in the Supplementary 
Information. Tag integration was further verified by 
analyzing the genome re-sequencing data for each strain 
that was generated in the process of the ChlP-seq 
experiment. 

Culture preparation 

All cultures were grown in the standard CM for 
H. salinarum (250 g/1 NaCl, 20g/l MgS04, 2g/l KCl, 



3g/L Na-Citrate, 10 g/1 Oxoid peptone (Oxoid cat# 
LP0034) (Oxoid, Basingstoke, UK) supplemented with 
50mg/l uracil and filled to volume with distilled water. 
Cultures were revived from — 80°C freezer stocks and 
were streaked on agar plates. Starter cultures were 
inoculated from individual colonies and allowed to reach 
an optical density at 600 nm of 0.7 before inoculating a 
culture at a starting optical density at 600 nm of 0.03. Cells 
were grown under ambient light conditions in unbaffled 
flasks in a volume equal to 25% of the flask's maximum 
volume. All cultures were grown at 37 C and shaken at 
150rpm on a New Brunswick G-53 orbital shaker (New 
Brunswick, Edison, NJ). 

Immunoprecipitation 

Biological repHcates were conducted as inoculations in 
separate but identical volumes on the same orbital 
shaker. Cells were harvested at an OD 600 between 0.9 
and 1.0, which corresponds to early stationary phase for 
Hb. NRC-1. Cells were immediately fixed with 1% (v/v) 
formaldehyde for lOmin. Fixing was stopped through the 
addition of glycine to a final concentration of 125 mM. 
Batches of 1.75x10^^ cells were removed and pelleted at 
5000 xg. Cell pellets were washed twice with citrate-free 
basal salts, after which the pellets were frozen at — 80°C. 
Though this was our standard input for the IP reaction, 
we also examined the effect of decreasing the number of 
input cells for the IP reaction using the tfbD::HA strain. In 
addition to 1.75 x 10^^ ceUs, these scaling experiments also 
used 8.75 x 10^ 3.50 x 10^ 1.75 x 10^ and 3.50 x 10^ cells 
as input material for IP. The rest of the method is 
described in detail for 1.75x10^^ cells. Appropriate 
volumes and quantities of reagents for the scaled-down 
experiments are reported in Supplementary Table S3. 

Cell pellets were resuspended in 1.6 ml of lysis buffer 
(50 mM HEPES, 140 mM NaCl, 1 mM EDTA, 1% (v/v) 
Triton X-100, 0.1% (w/v) sodium deoxycholate, pH 7.5) 
containing protease inhibitors (Roche cat# 04693159001). 
Resuspended pellets were sonicated using a Bioruptor 
(Diagenode, Denville, NJ) until DNA fragment size 
reached an average of ~500bp (2-7.5-min cycles, 30 s 
on/30 s off, high power setting). 

Cell lysate was combined with 1 |ig of anti-HA antibody 
(Abeam cat# ab9110) (Abeam, Cambridge, MA) and 
protein A-conjugated Dynabeads (Invitrogen cat. # 
100. 2D) preblocked with 5mg/ml BSA in phosphate- 
buffered saline and incubated overnight at 4°C. 
Dynabeads were washed two times with the lysis buffer, 
two times with 1 ml of the lysis buffer supplemented 
with 500 mM NaCl, two times with 1 ml wash buffer 
(10 mM Tris, 250 mM LiCl, 0.5% NP-40 (v/v), 0.5% 
Na-deoxycholate (w/v), 1 mM EDTA, pH 8.0) and one 
time with 1 ml TE buffer. Enriched ChIP DNA/ transcrip- 
tion factor complexes were eluted by the addition of 50 |il 
elution buffer (50 mM Tris, 10 mM EDTA, 1% SDS 
(w/v), pH 8.0) and incubation at 65°C for lOmin. 
Cross-links were reversed by incubating in TE/SDS 
(10 mM Tris, 1 mM EDTA, 1% SDS) overnight at 65°C. 
RNA was digested and DNA sample was subsequently 
prepared for lUumina single read sequencing. 
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ChlP-seq library preparation 

Individual ChIP samples were blunt ended with T4 DNA 
polymerase (NEB cat. # M0203L), Klenow large fragment 
(NEB cat. # M0210L) and T4 polynucleotide kinase (NEB 
cat. # M0201L) at 20°C for 30min. Blunt-ended DNA was 
y A tailed with 3^->5^ exo- Klenow fragment (NEB cat. # 
M0212L) for 30min at 37°C. Adapters containing 6 bp 
barcodes were Hgated to the prepared ChIP DNA 
samples for 15min at room temperature with T4 DNA 
ligase (Enzymatics cat. # L603-HC-L). Barcode sequences 
are provided Supplementary Table S4. A background 
control of whole cell extract genomic DNA from each 
sample was prepared as mentioned above. Samples were 
then used as template for an 18-cycle PGR ampHfication. 
PGR products were quantified and visualized with a 
high-sensitivity DNA chip (Agilent cat. # 5067-4626) on 
a bioanalyzer (Agilent, Santa Glara, GA). GhIP and back- 
ground libraries were pooled in equimolar concentrations 
and loaded onto a single Illumina lane. 

Western blotting 

Transcription factors were immunoprecipitated under the 
same conditions as GhIP methods mentioned above. IP 
samples were run in one dimension on 4-12%, 1.5mm 
polyacrylamide gel (Invitrogen cat. # NP0335) in MOPS 
buffer (Invitrogen cat. # NPOOOl). Protein was then 
blotted onto a 0.2 -|im pore size PVDF membrane 
(Invitrogen cat. # LG2002) at 30 V for one hour in 
transfer buffer (25 mM Tris, 192 mM glycine, 10% (v/v) 
methanol, pH 8.4). PVDF was blocked in 0.5% (w/v) 
casein overnight and subsequently probed with HRP- 
conjugated anti-HA antibody (Abeam cat. # abl265). 
The blot was incubated with GE EGL plus reagents 
(Amersham cat. # RPN2132) according to the manufac- 
turer's suggestions and exposed to light-sensitive film. 

qPCR verification 

qPGR was performed on a Bio-Rad Ghromo 4 Real-Time 
Detector (Bio-Rad, Hercules, GA) using KAPA SYBR® 
FAST Universal 2x qPGR master mix (Kapa Biosystems 
cat.# KK4601) (Kapa Biosystems, Woburn, MA) accord- 
ing to the supplied protocol. Primer sets for enriched 
regions and negative regions were designed using 
known enriched sites and unenriched sites, respectively, 
from previous GhlP-seq and GhlP-chip data (see 
Supplementary Table SI for primer sequences). Fold en- 
richment above background was calculated as 2 to the 
power of cycle threshold difference between a 
non-enriched region and an expected enriched site. WGE 
extract, GhIP samples and amplified libraries were all used 
as template for a qPGR reaction. These were all confirmed 
by comparing to a set of GhlP-control reactions on the Hb 
NRC-1 ApyrF strain. 

Sequencing, processing and ChlP-seq peak calling 

Multiplexed samples were sequenced to 40 bp on the 
Illumina GA-II. Sequences were barcode sorted and 
quality trimmed (minimum Phred quality 20, minimum 
length 25 bp) using the FASTX-Toolkit (http:// 



hannonlab.cshl.edu/fastx_toolkit/) (Gordon,A and 
Hannon,G.J., unpubHshed results). Sequencing primer 
and adapter contamination were filtered using the 
TagDust package (24). QuaHty-filtered reads were 
mapped using Bowtie (25) to the Hb. NRC-1 reference 
genome with repeat sequences masked, and SAM format 
sequence files were converted to sorted BAM files using 
the samtools package (26). 

Putative protein-DNA binding events were detected 
using Pique, a novel microbially focused and freely 
available peak calling appHcation (available at https:// 
github.com/ryneches/pique, version tag: halo_egw) 
(Neches,R.Y., Wilbanks,E.G. and Facciotti,M.T., in 
preparation). Pique is written in Python and makes use 
of the SciPy signal-processing subroutines (27). Pique is 
able to operate on systems that have genomic complexities 
such as IS elements, gene dosage polymorphisms and ac- 
cessory genomes that cause coverage variations unrelated 
to GhIP, or in cases where the organism under study is not 
identical to the reference genome. The resulting enrich- 
ment 'pedestals' and 'holes' can be problematic for accur- 
ately detecting binding events and calculating enrichment 
levels. Hb. NRC-1 has several IS elements and two 
plasmids that exhibit dosage variations, and so a seg- 
mented analysis was performed by providing a genomic 
map of these features in the reference genome. 

GhlP-seq coverage data and candidate peaks were 
visualized using the Gaggle Genome Browser (18). 
Shared peaks were assessed using a combination of 
BEDTools (28) and custom R scripts. To assess the 
required sequencing depth for accurate and sensitive 
binding site identification, random subsampHng of a 
6 million read TfbD GhIP and WGE control runs were 
performed. 

RESULTS 

Epitope-tagged strain construction 

We developed a protocol for rapidly engineering the 
strains of the Hb. NRC-1 with epitope-tagged target 
proteins under the control of their native promoters. 
Using different classes of transcription factors, we demon- 
strate this approach's application and utility. In bacteria 
and archaea, different transcription factors may bind a 
wide span of target sites, ranging from one to hundreds. 
We chose to collect localization data from two extreme 
cases. The general transcription factor TfbD is known to 
bind hundreds of promoters (16). As an example of a 
specific regulator of a smaller regulon, we examined the 
Bat transcription factor. This transcription factor pre- 
dicted to bind up to four potential sites and is one of 
the few haloarchaeal transcription factors with a 
well-described binding motif (29). 

Our epitope-tagging protocol for Hb. NRC-1 employed 
a homologous recombination method analogous to the 
gene deletion strategy developed by Peck et al. (30). This 
epitope knock-in strategy was described by Zhang et al. 
for GhlP-chip appHcations in human somatic cell Hnes 
(31). For tfbD::HA strain construction, the novel 
vector pRSKOl, which is compatible with Gateway 
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ligation-independent cloning (Invitrogen, Carlsbad, CA), 
was created to facilitate and further standardize strain 
construction. This new vector allowed the use of either 
commercial DNA synthesis technology or PCR-mediated 
SOEing (23) with Gateway compatible primers to rapidly 
construct the vectors used for strain construction. While in 
most instances PCR SOEing is still less expensive, the 
decreasing cost of DNA synthesis should make this 
approach a more attractive alternative for future 
large-scale strain construction projects. 

Hb. NRC-1 ApyrF transformed with a recombinant 
plasmid that contained the terminal 500 bp of the target 
gene, a hemagglutinin (HA) tag, stop codon and 500 bp 
downstream sequence (Figure 1). Homologous recombin- 
ation between the chromosomal target gene and the re- 
combinant plasmid sequence introduced the HA tag to 
the chromosomal sequence. Successful first recombinants 
were determined by PCR screening of Mev^ colonies 
(Supplementary Figure SI). The plasmid was subsequently 
resolved using 5-FOA counter-selection as previously 
described (22,30). Strains with C-terminally HA-tagged 
target proteins were further verified by PCR and Sanger 
sequencing (Supplementary Figures S1-S2 and 
Supplementary Information). For the rapid construction 
of epitope-tagged transcription factor strains, this general 
strategy of utiHzing DNA synthesis and homologous 
recombination-based chromosomal modification can be 
readily extended to any organisms with a system for 
targeted genetic knockouts. 

Western blotting with anti-HA antibody confirmed 
the specificity of the ChIP assay in these chromosomally 
tagged strains (Supplementary Figure S3). The 
chromosomally integrated, epitope-tagged proteins 
remain under the control of their native promoters, as 
observed in the differential expression of the TfbD-HA 
protein over the course of growth in the recombinant 
strain ApyrF tfbD::HA (Supplementary Figure S4). 
Increase in the abundance of TfbD during stationary 
phase is consistent with previous reports of tfbD transcrip- 
tional abundance (20). This approach can be used to 
monitor dynamic changes in the GRN that occur under 
different physiological conditions. 

Identifying target protein DNA-binding sites 

We used ChlP-seq to analyze with two different classes of 
target proteins: the general transcription factor TfbD and 
the Bat using recombinant strains that natively express the 
target protein (ApyrF bat::HA and ApyrF tfbDy.HA). 
From the ChlP-seq datasets of 1.2 milHon reads for each 
factor, we identified 380 binding sites for TfbD and two 
binding sites for Bat (the brp and crtBl promoters). 

These punctate target protein DNA-binding events 
produce a distinctive bimodal, strand-specific enrichment 
pattern in sequence coverage (Figure 2). This enrichment 
pattern was leveraged to identify binding sites using our 
open source software package Pique, which reports the 
candidate binding site's enrichment as the ratio of 
sequence coverage in the IP data relative to a background 
control (Neches,R.Y., Wilbanks,E.G. and Facciotti,M.T., 
in preparation). Pique is implemented from a user-friendly 
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AAAA I kA/WAA> 

PCR screen for 
target_gene::HA recombinant 

Figure 1. Epitope tag-in approach for Hb. NRC-1. The Hb. NRC-1 
ApyrF strain is transformed with a plasmid containing the 
mevinohn-resistance determinant (Mev^; dark gray box) and the pyrF 
gene (black box) that confers 5-FOA sensitivity. The plasmid carries an 
engineered sequence containing the HA epitope sequence (white box) 
flanked by the last 500 bp of the target gene and the 500 bp downstream 
of the target gene (Hght gray boxes). Plasmid sequence is shown as solid 
line, chromosomal sequence is shown as sohd, wavy hues. Cross-over 
can occur between target gene (hght gray box) and flanking sequence 
(gray wavy hne) in the chromosome and the homologous regions in the 
plasmid sequence, at either position 1 or 2 (position 1 example shown). 
PCR screening of mevinohn-resistant colonies is used to determine suc- 
cessful first recombinants. Subsequent plating on 5-FOA selects for 
second recombinants (via counter-selection with the pyrF gene). In 
this example, a second cross-over at site 2 produces the desired 
chromosomally integrated recombinant target^ene::HA fusion. 
PCR screening of these colonies is required to distinguish this 
desired second recombinant from a second recombinant occurring at 
position 1. Drawing is not to scale. 



graphical user interface and exports predicted binding 
sites to the Gaggle Genome Browser (18). From the 
Gaggle genome browser, users can explore and curate 
the data before proceeding with analysis via other down- 
stream Gaggle tools (Figure 3). 

To confirm the specificity of the HA antibody for the 
Chip assay on the recombinant HA-tagged strains, we 
conducted ChlP-seq on the ApyrF parent strain where 
no HA tags were expressed. The data contained a single 
peak, at chromosomal position 166589, likely because of 
the presence of a similar epitope in a native protein. This 
peak was present in all datasets examined and was subse- 
quently filtered from all downstream analysis. 
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Figure 2. The 5' to y sequencing requirement and short read length 
produce stranded bias in sequence coverage. The shaded blue oval rep- 
resents the protein of interest bound to DNA (sohd black lines). Wavy 
hnes represent either sense (blue) or antisense (red) DNA fragments 
from Chip enrichment. The thicker portion of the hne indicates 
regions sequenced by short read sequencing technologies. Sequenced 
tags are aligned to a reference genome and shown below is the 
strand-specific sequence coverage at each position in the genome. 
Punctate-binding events (e.g. transcription factors) are characterized 
by well-defined strand-specific bimodahty in sequence coverage. 



Independent biological replicates of the TfbD ChlP-seq 
experiment showed good reproducibihty (82% 
overlapping binding sites). Differences between the two 
replicates are due to the smallest peaks in each dataset, 
as indicated by examining increasingly stringent enrich- 
ment thresholds (Supplementary Figure S5). Both 
binding sites in the Bat dataset and two example sites in 
the TfbD datasets (VNG906H and atp _p promoters) were 
confirmed with a ChlP-qPCR assay. The relationship 
between the quantification of ChIP enrichment at the 
binding sites determined by qPCR and sequencing was 
found to be well correlated across several experiments 
(Figure 4). The TfbD ChlP-seq binding sites agreed well 
with previously reported sites from TfbD ChlP-chip ex- 
periments (16). For both the ChlP-seq and ChlP-chip 
methods, 80% of consensus binding sites from biological 
replicates were identified by at least one of the replicates 
from the other method. 



Spatial resolution 

The spatial resolution of the ChlP-seq binding site predic- 
tion was assessed for the Bat and TfbD datasets. As the 
Bat transcription factor has a well-characterized predicted 
binding motif (29), we used the distance from our pre- 
dicted binding site to the motif center as an estimate of 
the spatial resolution. The binding sites at the brp and 
crtBl promoters were found, respectively, at 20 and 



27 bp upstream of the predicted Bat-binding motif center 
(3^ displaced). 

As there exists no well-defined binding motif for the 
general transcription factor TfbD, we used proximity to 
the nearest predicted transcript start site (TSS) as a 
measure of spatial resolution for this factor. Archaeal 
TFB proteins, homologs of the eukaryotic factor TFIIB, 
canonically bind at B-recognition elements ~30-50bp 
upstream of the TSS, in association with a 
TATA-binding protein (32). Eighty-six percent of the 
312 consensus binding sites from the TfbD ChlP-seq bio- 
logical replicates were found within 250 bp of a predicted 
TSS (in either y or 5^ direction). We measured the distance 
from each of these predicted binding sites to the nearest 
predicted TSS. These values were compared to the 
distance to TSS from previously reported TfbD 
ChlP-chip sites, determined with both 500-bp contiguous 
and 12-bp overlapped tiUng microarray (Figure 5) (16). 
The distance to TSS for the ChlP-seq binding sites 
(median = 32, mean = 51) is in agreement with the 
expected binding pattern for this general transcription 
factor, and is significantly smaller than that predicted by 
both resolutions of ChlP-chip microarray (Mann Whitney 
U-test, p value < 0.005). The variance in these distance 
measurements provides an estimate of the precision with 
which each method maps binding sites. The ChlP-seq 
dataset has significantly decreased variance relative 
to both ChlP-chip datasets (significance assessed 
by Bartlett test, p value < 0.005; samples were tested 
for normality by two-sided Kolmogorov-Smirnov 
p value < 0.0005). Taken together, these data indicate 
that the ChlP-seq assay offers improved spatial resolution 
in mapping the target protein-binding site relative to prior 
ChlP-chip assays. 

Chip assay cell number and sequencing depth 

We investigated the number of cells required for ChIP to 
produce sufficient enrichment at target protein-binding 
sites. Decreasing the number of cells in the ChIP 
reaction lowered the observed enrichment at target 
protein-binding sites; however, enrichment (>5x) was de- 
tectable for highly enriched TfbD-binding sites when as 
few as 3.50 X 10^ cells were used for ChIP, equivalent to 
1 ml of a typical culture (Figure 6A). Because of the 
overall decrease in enrichment, smaller cell number 
ChlPs were less sensitive in binding site detection but 
maintained specificity (Figure 6B). The relatively small 
volume required for this ChIP assay should enable the 
high-throughput application of this method in the 
context of dynamic binding studies by allowing for 
the repetitive sampling of numerous strains with minimal 
perturbation. 

One of the main advantages to the ChlP-seq platform 
for small microbial genomes is the abihty to decrease the 
experimental cost by multiplexing many samples in a 
single sequencing lane. We carried out an in silico 
analysis to determine the depth of sequencing necessary 
to achieve sensitive and accurate detection of binding 
sites. Sequence reads were randomly subsampled 
to decreasing coverage levels from 1.2 M reads 
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Figure 5 for further details). 



(15.4x coverage) to 10000 reads (0.13 x coverage). For the 
TfbD dataset, the number of peaks identified remained 
stable down to 500K reads (5x coverage), after which 
the sensitivity began to decrease (Figure 7). The specificity 
of binding site identification remained excellent below 
500 K reads, even though the sensitivity decreased 
(Figure 7). 

For the Bat dataset, the more strongly enriched binding 
site at the brp promoter could be detected in datasets as 
small as 50 000 reads (0.64 x coverage), whereas the 
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Figure 5. Distance from predicted TfbD-binding site for ChlP-seq 
(consensus between biological repHcates), 500-bp tihng microarray 
ChlP-Chip (consensus between biological replicates) and 12-bp tiling 
microarray ChlP-Chip experiments. The observed difference in means 
was statistically significant (Mann Whitney U-test, p value < 0.005), as 
is the observed difference in variance (Bartlett test, p value < 0.005). 



weaker binding at the crtBl promoter was undetectable 
below 150 000 reads (1.9 x coverage). No false positives 
were detected in these lower coverage datasets, with the 
exception of a single site in the 200 000 read dataset. 
Examining the effect of decreased coverage on the 
spatial resolution, we found that the automated prediction 
of the binding site remained accurate, until just before the 
site became undetectable (Figure 8). 



DISCUSSION 

We report here a workflow for the genome-wide 
mapping of archaeal natively expressed transcription 
factors using a standardized, cost-effective, high- 
throughput ChlP-seq platform. This is the first example 
of a ChlP-seq protocol for archaea. The development of 
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this high-throughput method for mapping protein-DNA 
binding events in archaea should catalyze the investigation 
of the GRNs in this third domain of Hfe. 

While major advances have been made in mapping 
the GRNs of bacteria and eukaryotes, the GRNs of 
archaea represent a nascent area of exploration, with 
only a handful of genome-wide experimental studies 
(8,16,17,33,34). Understanding regulatory mechanisms in 
archaea will greatly inform our understanding of the basic 
biology of this important domain, relevant to diverse fields 
of study from biogeochemistry to biotechnology. The de- 
velopment of improved methods for surveying archaeal 
GRNs is timely, coinciding with a new wave of archaeal 
genome sequencing that has suggested many conserved 
archaeal regulatory mechanisms (34,35). 

Archaea also possess an intriguing mosaic tran- 
scriptional apparatus that exhibits properties of both 
eukaryotic and bacterial systems. While the basal tran- 
scriptional machinery of archaea is homologous to that 



Number of reads in dataset (xlO^) 

Figure 8. Spatial resolution of binding sites calculated from randomly 
subsampled Bat ChlP-seq datasets of decreasing coverage. The distance 
from ChlPseq predicted binding site to nearest Bat-binding motif was 
calculated for both the crtBl (dark gray bars) and brp (fight gray bars) 
promoter regions. Binding sites for in the crtBl promoter could not be 
detected for datasets with 100000 reads and below (no bars in plot). 



of eukaryotes, archaeal transcriptional regulators are 
more similar to those of bacteria. Studies of archaeal tran- 
scription have provided insight into both the mechanisms 
and evolution of information processing in the three 
domains of life (32,36). Likewise, deciphering archaeal 
GRNs holds a great potential for advancing our under- 
standing of fundamental principles employed by GRNs 
across the tree of life. 

The workflow presented here is cost efficient and 
amenable to high- throughput scaling. To increase 
throughput and minimize costs, we relied on recombinant 
strains with low-profile HA epitope-tagged target proteins 
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and a standard anti-HA antibody for the ChIP assay. The 
HA epitope tag in conjunction with the well-characterized 
commercial antibody proved efficient for IP while minim- 
ally perturbing the target protein. The choice of the ster- 
ically slim HA epitope tag can prove quite important; we 
have found that the larger repeated myc epitope tag (37) 
can render some DNA-binding proteins nonfunctional. 
The HA tag does not disrupt Bat function, as seen in its 
ability to complement a bat knockout mutation and 
induce bacteriorhodopsin production (Supplementary 
Figure S6). 

A transcription factor's occupancy of possible binding 
sites depends, in part, on its concentration within the cell. 
As our ultimate goal is to map the dynamics of regulatory 
network rearrangement, native expression of the target 
transcription factor, rather than constitutive expression 
from a plasmid construct, was an important feature of 
our approach. To accomplish this, we used recombinant 
target proteins that were chromosomally integrated under 
the control of the wild-type promoter. The construction of 
these recombinant archaeal strains required the develop- 
ment of a method for generating chromosomally 
integrated recombinant proteins in Hb. NRC-1, thereby 
expanding the genetic toolbox available for this model 
archaeon. 

Previous ChlP-chip studies in Hb. NRC-1 used target 
proteins that were constitutively expressed at nonnative 
levels from a heterologous plasmid (16,33,38). The result- 
ant protein-DNA associations are, therefore, perhaps best 
viewed as lists of all possible interactions rather than a 
snapshot of protein-DNA association network under 
physiological conditions. While this approach is appropri- 
ate for some applications and can offer technical advan- 
tages, such as improving ChIP efficiency for proteins 
present in low abundance, expression of target proteins 
at nonnative levels can produce artifacts in the list of 
protein— DN A binding sites. In the simplest case, consti- 
tutive overexpression can drive transcription factor asso- 
ciation to weak or nonspecific sites without significantly 
perturbing expression. Ambiguities concerning which 
binding sites are physiologically relevant can sometimes 
be resolved by incorporating data such as transcriptomes 
and regulatory motifs in the analysis. However, the per- 
turbation of transcription factor expression can also have 
more serious consequences that cannot be easily resolved, 
such as unintended protein-protein interactions and 
changes in the cellular phenotype. Lastly, constitutive 
nonnative expression precludes investigating the 
dynamics of transcription factor association, a fundamen- 
tal aspect in understanding the relationship between the 
GRN structure and function. 

We demonstrated the application of our ChlP-seq 
protocol on two different classes of archaeal transcription 
factor: a general transcription factor with many binding 
sites (TfbD) and a more specific transcriptional activator 
(Bat). ChlP-seq data were analyzed with the user-friendly, 
open-source Pique package, designed for identifying 
protein-DNA binding events in small bacterial and 
archaeal genomes. Our bioinformatics pipeline integrates 
with the Gaggle toolkit to facihtate downstream data visu- 
alization, curation and analysis. 



The predicted binding sites were consistent between bio- 
logical repHcates and with previously pubHshed ChlP-chip 
results for TfbD. We observed few significant trends in 
the gene classes bound by TfbD, with the exception of 
gene functionally associated with GTPase activity 
{p value = 1 X 10""^). The lack of obvious functional par- 
titioning of TfbD target genes is unsurprising, given this 
factor's broad role in global transcription initiation. 
Dynamic ChlP-seq experiments under different physio- 
logical conditions would likely be an appropriate future 
method for determining the potential regulatory roles 
carried out by TfbD and other archaeal general transcrip- 
tion factors. 

The two Bat-binding sites discovered in the brp and 
crtBl promoters i^brp, and VcrtBi) were verified by qPCR 
and contain two of the four previously reported 
occurrences of the Bat regulatory motif (^top and P^/^ 
were not bound) (29). It seems initially surprising that 
Bat binding was not detected upstream of the 
bacteriorhodopsin apoprotein (bop), a gene it regulates 
(29,39-41). However, recent research has shown that Bat 
regulation of bop expression is complex and may work 
cooperatively with accessory proteins Brz and Brb (42,43). 

Interestingly, the Bat-binding motif at P^^^^ and Vchbi 
share a single nucleotide insertion relative to the two 
unbound motif sites at Vj^op and P^/^. Furthermore, the 
spacer sequence between the Bat motif and the 
TATA-box is shorter at Vj^rp and P^rr^y sites (2 and 3 bp 
spacer, respectively), relative to the unbound P^^/? and P^/^ 
motif occurrences (5 bp spacer) (29). We note that these 
small differences in the Bat motif, in concert with our 
binding data, may provide some preliminary evidence to 
suggest a way by which Bat (and potential coregulators) 
distinguishes between the four predicted binding sites in a 
condition-specific manner. 

ChlP-seq identifies protein-binding sites with fine 
spatial resolution and provides accurate estimates of 
binding site enrichment. The quantification of enrichment 
found at protein binding sites calculated by ChlP-seq was 
very similar to that determined by ChlP-qPCR (Figure 5). 
Unlike ChlP-chip enrichment values, which become 
saturated at high levels of enrichment, ChlP-seq has ex- 
cellent dynamic range, and thus provides an accurate 
metric for the level of enrichment at target protein-binding 
sites. A narrow size selection of chromatin immunopre- 
cipitated DNA fragments enhances the enrichment in 
sequence coverage at target-binding sites. The use of auto- 
mated size-selection instruments, such as the Pippin Prep® 
(Sage Science, Beverly, MA), in the preparation of 
ChlP-seq Hbraries may improve data quahty. 

The number of cells required for the ChIP assay was 
investigated to determine an optimal protocol that 
balances sensitivity of binding site detection with through- 
put and ease of sample handling. Decreasing the number 
of cells for the ChIP assay was found to decrease the en- 
richment level at target protein-binding sites, and thus the 
sensitivity of the assay. However, the more strongly 
enriched sites could still be accurately detected with 
3.50 X 10^ cells, equivalent to ~1 ml of a typical culture. 
The false-positive rate remains very low in the lower cell 
number ChlP-seq experiments. The ability to use low cell 
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counts as input makes this approach tractable for 
high-throughput assessment of the more prominent 
binding sites in the genome, though precludes develop- 
ment of an exhaustive Hst of possible protein-DNA 
interactions. 

By randomly subsampHng deeply sequenced datasets, 
we determined that the required sequence coverage for 
the sensitive detection of binding sites corresponds to 
~6.5x coverage of the complete genome (approximately 
500 K reads in Hb. NRC-1). We estimate that for the aver- 
age sequencing run on the Illumina HiSeq (80 million 
reads) and the typical bacterial and archaeal genome 
(~3Mb), 130 samples can be multiplexed per lane. With 
this level of multiplexing, the ChlP-seq assay would cost 
roughly $15 per sample. The per-sample cost is expected 
to drop even further with continuing improvements in 
the output of sequencing technologies. Relative to 
ChlP-chip, this ChlP-seq workflow greatly reduces the 
experimental cost of defining the genome-wide binding 
sites of target transcription factors while also improving 
spatial resolution. From gene tagging to data analysis, this 
workflow provides an excellent model for conducting 
large-scale, dynamic mapping of bacterial and archaeal 
gene regulator networks. 

ACCESSION NUMBERS 

All high-throughput sequencing data generated in this 
work are available via BioTorrents (http://www 
.biotorrents.net/details.php?id = 259) or from our lab 
website (www.bme.ucdavis.edu/facciotti/resources_data/ 
data/). The open access Pique package and source code 
can be obtained via github at https://github.com/ryneches/ 
pique. Primer and plasmid sequences used in this study are 
available in Supplementary Information. 
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